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ABSTRACT 

Distributed-memory matrix multiplication (MM) is a key 
element of algorithms in many domains (machine learning, 
quantum physics). Conventional algorithms for dense MM 
rely on regular/nniform data decomposition to ensnre load 
balance. These traits conflict with the irregular structure 
(block-sparse or rank-sparse within blocks) that is increas¬ 
ingly relevant for fast methods in quantum physics. To 
deal with such irregular data we present a new MM al¬ 
gorithm based on Scalable Universal Matrix Multiplication 
Algorithm (SUMMA). The novel features are: (1) multiple- 
issue scheduling of SUMMA iterations, and (2) hne-grained 
task-based formulation. The latter eliminates the need for 
explicit internodal synchronization; with multiple-iteration 
scheduling this allows load imbalance due to nonuniform ma¬ 
trix structure. For square MM with uniform and nonuniform 
block sizes (the latter simulates matrices with general irreg¬ 
ular structure) we found excellent performance in weak and 
strong-scaling regimes, on commodity and high-end hard¬ 
ware. 
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1. INTRODUCTION 

High performance parallel algorithms for matrix multi¬ 
plication (MM) — the most important special case of the 
general tensor contraction, and often its building block — 
have been studied for decades [13[ |22[ |32[ |28[ |17[ |50[ 

|23|. MM nevertheless continues to draw attention of re¬ 
searchers 1^ [M [3^ due to the 

continuing evolution of the computer hardware as well as 
the prominent role of matrix and tensor computation in a 
variety of scientific domains, such as physics of classical and 
quantum fields (most notably, electronic structure |25[ 
20 47 ) as well as data analysis and model building in ma- 
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chine learning , chemometrics , neuroscience [33| , and 
many more. A frontier challenge posed by the scientific do¬ 
main needs is the increasing importance of sparse and non¬ 
standard tensorial data representations (dense and sparse 
tensors, multiresolution spectral element trees, H matrices, 
matrix product states, tensor networks, etc.). Another ma¬ 
jor challenge is the increasing complexity of communication 
hierarchy and the continuing increase of the communication 
cost, relative to that of computation. This spurs the search 
for algorithms that minimize/avoid communication and/or 
hide its cost. Here we explore improvements of a standard 
dense matrix-multiplication algorithm that can hide com¬ 
munication costs and tolerate network latencies, data inho¬ 
mogeneity, and sparsity, which serves as a platform for the 
development of algorithms on irregular tensorial data struc¬ 
tures. 

To demonstrate the tension between standard dense MM 
algorithms and the needs of emerging scientific domains, 
consider a concrete example of chemistry and materials sci¬ 
ence. The matrices in such context represent quantum states 
(of electrons) and operators represented in some basis. Ef¬ 
ficient application of operators to states — represented by 
matrix multiplication — demands taking advantage of the 
matrix structure arising from the physical properties of the 
entities these matrices represent. Such structure could be 
(a) block-sparsity due to the distance decay of the opera¬ 
tor kernel and the localized nature of basis functions 26 


(b) symmetries under geometric and other transformations 
[41[ 2^ , or (c) block-rank-sparsity due to near-sightedness 
of physical interactions [35| |34[ [3^ {i.e. blocks of the ma¬ 
trix/tensor are dense but have low-rank representations; in 
the applied math community related matrix structures are 
known as H matrices). Notably, the matrix structure is affili¬ 
ated with the problem-specific blocking of matrix dimensions 
that arises due to domain-specific needs and often cannot be 
chosen arbitrarily. In other words, the matrices that we en¬ 
counter are “sparse” in a general sense, which encompasses 
element, block, and block-level-rank sparsity; but in a prac¬ 
tical sense the matrices are not sparse enough to be a good 
match for sparse MM algorithms. Nonuniform blocking and 
inhomogeneity of data due to the matrix structure conflict 
with the uniform data distribution of standard distributed- 
memory dense MM algorithms — including Cannon’s [13| , 
SUMMA [^, and others m EH [1300 • 


To design algorithms for multiplication of matrices with 
an irregular structure, we propose a reformulation of stan¬ 
dard, dense MM algorithms that allows them to be toler¬ 
ant of data inhomogeneity by expressing them in terms of 





















tasks. By using the standard algorithm as the framework our 
approach will be potentially optimal for dense uniformly- 
blocked matrices, and yet capable of handling irregularity 
in the matrix structure. Task-based/datafiow programming 
models are a natural choice for implementation of algorithms 
with irregular data and computation patterns; such models 
have been used successfully for dense matrix algebra appli¬ 
cations [27[ 1^. Besides handling matrices with structure, 
a task based approach will provide additional benefits: (a) 
inter-node communication costs can be partially or fully hid¬ 
den by overlapping computation and communication, (b) 
performance should be less sensitive to topology and tol¬ 
erant of latency and CPU clock variations, (c) fine-grained 
task-based parallelism is a proven means to attain high intra¬ 
node performance by leveraging massively multicore plat¬ 
forms and hiding the costs of memory hierarchy (e.g. Intel 
TBB, Cilk), (d) lack of global synchronization allows the 
overlap multiple high-level stages of computation {e.g. mul¬ 
tiple matrix multiplications contributing to the same expres¬ 
sion) . 

This paper describes our approach, its implementation, 
and performance for matrix multiplication of square ma¬ 
trices with uniform and nonuniform blocking. Our imple¬ 
mentation is available as part of the open-source project 
TiledArray (a general-purpose tensor library) [12|. 


2. DISTRIBUTED-MEMORY MATRIX MUL¬ 
TIPLICATION 

This section contains an overview of the relevant literature 
on dense matrix multiplication algorithms for distributed 
memory computers. Our discussion does not consider fast 
algorithms [^, e.g. Strassen [^, that have computational 
complexity lower than Q{N^) of the naive algorithm, where 
N is the matrix size. Thus for our purposes, all algorithms 
are computation-optimal and only differ in the algorithm 
construction, communication patterns, and memory usage 
per node. 

An early parallel matrix-multiplication algorithm that has 
gained widespread use is Cannon’s algorithm. In this method, 
the input and output matrices are embedded onto a two- 
dimensional (2D) mesh of P processes (nodes), where the 
input matrices are moved via a series of row and column ro¬ 
tations in systolic loops Cannon’s algorithm asymptot¬ 
ically meets the lower bound on communication, Q,{N^/'/P) 
|30| , for 2D algorithms that require 0{N^/P) memory per 
process. The broadcast-multiply-roll algorithm of Fox et al, 
which is also asymptotically optimal, introduced the use of 
a (pipelined) broadcast for movement of one of the two in¬ 
put matrices 
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Unfortunately neither algorithm is 


universally applicable |50| |43| . Some of the limitations were 
removed in the Transpose algorithm of Lin and Snyder , 
and further by Choi and coworkers in Parallel Universal Ma¬ 
trix Multiplication Algorithm (PUMMA) [Tt] and by Huss- 
Lederman et al. in the BiMMeR algorithm [28| . Perhaps the 
simplest and easiest to generalize is a group of 2D algorithms 
utilizing row/column broadcasts followed by rank-fe updates. 
These algorithms were developed independently by several 
groups [2| |16[ [^ , most notably by van de Geijn and Watts 
who dubbed the approach Scalable Universal Matrix Mul¬ 
tiplication Algorithm (SUMMA) [^. The simplicity and 
flexibility of SUMMA made it very successful {e.g. it is the 
standard implementation of GEMM in ScaLAPACK), and 


has motivated several variants [14[ |23| |10| . In addition, the 
core ideas are also widely used in linear algebra algorithms 
[39} [46l [47| [m] . Since our work is based on SUMMA, it is 
described in more detail in Section (2.II 

All 2D algorithms described so far are optimal with re¬ 
spect to memory use. However, it is possible to reduce the 
communication cost further by using a three-dimensional 
(3D) mesh of processes at the cost of additional memory 
usage |30[ [4]. In 3D algorithms [T] , each matrix is repli¬ 
cated across one dimension of a yPx ^P x mesh. This 
data distribution reduces the communication cost by a fac¬ 
tor of but requires 0{N '^/data per node, a 

factor of P^^ increase over the 2D case. The gap between 
the 2D and 3D algorithms is bridged by the so-called 2.5D 
algorithms that are communication-optimal with and 
without memory constraints [^. 

Although 2D MM algorithms are not universally commu¬ 
nication-optimal, we view them as key for two reasons: (a) 
they are the only choice under tight memory constraints, 
which is often the case in data intensive applications, and 
(b) they are a building block for higher-dimensional MM 
algorithms 46 . Thus in our work we focus on SUMMA as 


the most popular 2D algorithm. 

2.1 2D SUMMA and Its Variants 

SUMMA implements matrix multiplication C = AB as 
a series of rank-fc updates. The input and output matrices 
are embedded on a rectangular process mesh in an element- 
cyclic or block-cyclic manner to ensure approximate load 
balance. In each iteration of the algorithm a column/row 
panel of A/B is broadcast along rows/columns of the process 
grid, respectively; matrix C remains stationary through¬ 
out the procedure (variants of SUMMA in which A or B 
are stationary are also possible; transposed multi plie s, e.g, 
C = ABt, are also relatively simple to handle 
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Each pair of broadcasts is followed by a rank-fe update, 
Cij AikBkj + Cij, (Einstein summation convention is used 
throughout). Original SUMMA papers by van de Geijn and 
Watts and by Agarwal et al. considered versions of 
the algorithm that overlapped computation and communica¬ 
tion by pipelining and preemptive broadcasts, respectively, 
and other broadcast variants have been considered 43 , in¬ 


cluding topology-specific broadcasts [^. Eor simplicity, we 
present a SUMMA version with preemptive broadcasts in 
Figure 


Algorithm 1 SUMMA with non-blocking broadcast 

1: Broadcast(A(*, 0),0,roui-grroup) 

2: Broadcast{B{ 0,*),0, col nroup) 

3: for fe = 0,..., A - 1 do 

4: if fe -I- 1 < A then 

5: rowATOot (fc -)- 1) mod cols 

6: col_root ■<— (fe -|- 1) mod rows 

7: Broadcast(A(*, k -|- l),rowAroot, row^group) 

8: BROADCAST(i3(fc -|- 1, *), coljroot, col_group) 

9: end if 

10: WAIT(A(*,fe)) 

11: WAIT(A(fe,*)) 

12: G(*, *) <— q:A(*, fe) • B(fe, *) -f C'(*, *) 

13: end for 


DIMMA, an early variation of SUMMA introduced by 













Choi in 1998, improved performance of synchronous SUMMA 
by reducing the slack in communication 14 15 . The key 


insight of this work was that the order of broadcasts in the 
original SUMMA implementation, coupled with the com¬ 
munication barriers, created a significant slack in commu¬ 
nication. Iterations were reordered in DIMMA such that 
each node broadcasts all of its data in succession as opposed 
to the round-robin approach in SUMMA. Similar improve¬ 
ments can be attained by overlapping communication and 
computation [^. 

SUMMA was recently extended to sparse MM (SpSUMMA) 
by Bulug and Gilbert 10 11 . This algorithm is similar 


to dense SUMMA, except matrix sub-blocks are stored in 
a doubly compressed sparse column (DCSC) format and 
sparse generalized matrix-multiplication (SpGEMM) is used 
to compute rank-A: updates. The main challenges of all 
sparse MM algorithms, including SpSUMMA, are the in¬ 
creased relative costs of communication compared to the 
dense case, load imbalance, and the relatively low intran¬ 
ode performance of sparse matrix kernels. 

The problem of load imbalance does not appear in dense 
SUMMA implementations as the work load is nearly-optimally 
balanced. With random sparsity approximate load balance 
is achieved in the asymptotic limit, however with structured 
sparsity (e.g. matrices with decay) one does not expect 
rows/columns to be uniformly filled. 

In the regime of high sparsity (low matrix fill-in factors) 
the communication time dominates the computation time 
due to the effectively-reduced benefit of blocking. The in¬ 
creasing role of communication in sparse MM can be some¬ 
what alleviated by communication hiding. Bulug and Gilbert 
discussed potential benefits of communication hiding for sparse 
MM in Ref. but did not pursue this approach in their 
experiments due to lack of quality one-sided communication 
tools [^. Another possibility is to minimize communica¬ 
tion, e.g. by switching to 3D [^. Although sparse MM 2D 
SUMMA is not strongly scalable, nevertheless good scala¬ 
bility of SpSUMMA in practice was demonstrated 111] . 


3. TASK-BASED 2D SUMMA 

Our work to improve SUMMA is motivated by the needs of 
computation on matrices/tensors with irregular block struc¬ 
ture in the context of many-body quantum physics. Some 
of the challenges of computing with such data are similar to 
the general challenges of sparse MM: increased communica¬ 
tion/computation ratio and lack of load balance. The latter 
is compounded by the desire to use physics-based blocking 
of matrix dimensions. To address these challenges we de¬ 
cided to investigate a task-based formulation of SUMMA 
algorithm, to partially offset some communication costs and 
to alleviate the load imbalance. Prior efforts to reformu¬ 
late dense matrix multiplication using a tasks-based pro¬ 
gramming models are known [^; the novelty of our ef¬ 
fort is the focus on dense matrices/tensors as well as ma¬ 
trices/tensor with irregular structure, such as block-rank- 
sparsity, that cannot be handled straightforwardly using the 
standard dense-only approaches. In this section we describe 
the design of our algorithm, by highlighting the differences 
with the procedural SUMMA implementations We first 
analyze the data dependencies of discrete operations in the 
procedural SUMMA implementation. We then discuss the 
task composition and dependencies of our modified imple¬ 
mentation. For simplicity, we only consider the 2D SUMMA 


implementation; our approach is immediately applicable to 
the 2.5D variant since it’s based on 2D SUMMA. 

3.1 Dependency Analysis of Standard SUMMA 

Like all dense MM algorithms, SUMMA consists of data 
movement and computation tightly synchronized with each 
other (see Algorithm]^. Namely, the rank-fc update, Cij <— 
AikBkj+Cij, of each SUMMA iteration depends on the data 
from the broadcasts of the corresponding panels of A and B 
as well as the previous iteration’s rank-fc update. In addition 
to these data dependencies, each broadcast is synchronized 
with a prior rank-fe update since communication operations 
are initiated at the beginning of each SUMMA iteration, as 
shown in Figure (we denote such sequence dependeneies 
by dashed lines). Such design ensures that only a minimal 
memory is used (technically, nonblocking broadcasts require 
more memory than optimal). However, such design limits 
the amount of parallelism available in a given iteration (see 
the Figure]^. Specifically, we can parallelize the rank-fc up¬ 
date of C as well as the column and row broadcasts of A 
and B, but SUMMA iterations — although almost indepen¬ 
dent from one another — are executed serially, one after the 
other. Furthermore, such design is not tolerant of any source 
of load imbalance, due to, for example, processor clock vari¬ 
ation, network congestion, or — most importantly for us — 
due to the inhomogeneity of data. 
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Figure 1: A directed acyclic graph of the procedural 
SUMMA implementation consisting of broadcast (BCAST) 
and rank-fc updates. Solid edges indicate data dependen¬ 
cies, and dashed edges indicate sequence dependencies. 


3.2 Task-Based Implementation 

To tolerate the data inhomogeneity, whether due to block 
sparsity or block-rank sparsity, the task queue of each pro¬ 
cesses should have as many tasks as possible at any given 
time. To maximize the amount of available parallelism, we 
introduced the following modifications to the standard 2D 
SUMMA: 

• data overdecomposition, 

• multiple-issue schedule of SUMMA iterations, and 

• task-decomposed broadcasts 

Although the standard dense SUMMA allows near-perfect 
load balancing by using element-cyclic embedding of matri- 



































ces onto 2D mesh, we over decompose matrices into arbitrar¬ 
ily sized blocks, also embedded cyclically. In fact the dimen¬ 
sion blocking is a central concept in TiledArray library 
since it arises naturally in physical problems (hence the need 
to support arbitrary block sizes, including nonuniform block¬ 
ing). These blocks should be sufficiently small to ensure 
that each process is assigned many blocks, yet large enough 
to perform acceptably with high performance BLASS li¬ 
brary (used to perform block-level multiplies). Note that 
the overdecomposition also improves intra-node scalability 
when coupled with an optimized, parallel runtime like Intel 
TBB, and helps improve load balance in dealing with ma¬ 
trices/tensors that have permutational symmetries, e.g. in 
the CAST algorithm [41| . 

Decoupling of SUMMA iterations helps to increase the 
amount of parallelism beyond what is afforded by a single 
SUMMA iteration. This is similar to extracting more par¬ 
allelism out of SUMMA by performing batches of SUMMA 
iterations in parallel on 2D subsets of a 3D process mesh 
in 2.5D MM algorithms except here different iterations 
of 2D SUMMA will be concurrently executed and sched¬ 
uled on the same 2D mesh. Note that the dependencies 
between SUMMA iterations occur via the result matrix C. 
To remove these data dependencies we assume that there is 
enough memory available to split the rank-A: update oper¬ 
ation into two separate tasks: a matrix-multiplication task 
producing a temporary block, = Aik ■ Bkj, and a re¬ 
duction of the temporary into the result, Cij = ^ + Cij. 

Figure]^ shows the directed acyclic graph (DAG) of task- 
based SUMMA with inter-iteration dependencies removed. 
By scheduling multiple iterations at once, we are able to 
overlap and interleave communication and computation of 
different iterations. Also note that, in principle, all itera¬ 
tions of SUMMA can potentially be scheduled concurrently. 
However, in practice we limit the number of multiple-issue 
scheduling to several iterations of SUMMA, and schedule ad¬ 
ditional iterations as the preceding iterations are retired, in a 
pipelined fashion. The additional task dependencies needed 
to implement this throttling mechanism are not shown in 
Figurej^ The number of concurrent iterations is determined 
by the dimensions of the process grid. For example, given a 
process grid with Prow rows and Pcoi columns, the number 
of concurrent iterations, I, is equal to, 

f 2 , if Prow < 2 or Pool < 2 

f (Prow, Pool ) A) = i A , if Prow > A and Pool > A (1) 

( min(Prow , Pool) 

where A is the number of blocks along the inner dimension 
of the matrix multiplication. This limit is based on perfor¬ 
mance prohling, where we observed the maximum perfor¬ 
mance at min(Prow, Pool). 

Lastly, we decomposed broadcast tasks into subtasks each 
of which deals with a point-to-point communication. This 
achieves benefits similar to what pipelining does to stan¬ 
dard 2D SUMMA [50[ |49| , by introducing the overlap be¬ 
tween communication and computation of tasks working on 
a given SUMMA iteration. For example, a given process can 
start doing compute work as soon as the minimal amount of 
data — one block of A and one block of B — is available, 
concurrently with any additional communication of these 
blocks to other processes in the row and column groups. 
Work scheduling based on the availability of data in our 



Figure 2: A direct translation of SUMMA to task-based 
implementation. 


task approach should improve tolerance to latency and net¬ 
work topology . In the current implementation we use a 
hardware-oblivious binary tree broadcast, i.e. it is not op¬ 
timized for any particular interconnect hardware or topol¬ 
ogy, such as a multidimensional torus or fat tree. While 
our design does not preclude the use of optimized broadcast 
algorithms, a hardware oblivious algorithm should help per¬ 
formance portability across high-end and commodity hard¬ 
ware. 

The dynamic scheduling of computation and communica¬ 
tion trades off the predictable bounds on the resource use, in 
particular memory and network bandwidth, for high perfor¬ 
mance. For example, by multiple-issue schedule of SUMMA 
iterations increases memory use because each node must 
store an additional column/row block of the the argument 
matrices. Additional temporary storage might be used by 
matrix-multiplication producing temporaries subsequently 
reduced into the blocks of the target. The average mem¬ 
ory overhead per node per SUMMA iteration, as depicted 
in Figure is proportional to: 
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( 2 ) 


where k is the average block size of the inner dimension; m 
and n are the average number of rows and columns in the 
result matrix blocks, respectively; M and N are the num¬ 
ber of block rows and block columns in the result matrix, 
respectively; and Prow and Pcoi are the number of rows and 
columns in the process grid, respectively. However, we mit¬ 
igate the memory overhead associated with the local result 
matrix by decomposing the rank-fc-update tasks such that 
each matrix-multiplication task only computes a small sub¬ 
block of the local result matrix (see Figure]^. This decom¬ 
position of the rank-fc-update and broadcast operations are 
similar to the decomposition of work between nodes within a 
SUMMA iteration, but without the analogous communica- 
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Figure 3: A directed acyclic graph for a single, decomposed 
SUMMA iteration. Broadcast, matrix multiply, and reduc¬ 
tion tasks are decomposed such that each sub-block of the 
local result matrix, depends on two sub-block broad¬ 

cast tasks for A^^^ and B^^^^. 


tion (dependencies) between computation tasks. This allows 
different threads to work on different sub-blocks of the lo¬ 
cal result matrix concurrently. The maximum memory over¬ 
head is therefore bounded by the number of threads and only 
weakly dependent on the number of concurrent iterations. 
We further mitigate memory costs by combining the matrix- 
multiplication and reduction tasks whenever possible, which 
eliminates the memory and computational overhead of a 
separate reduction task. As a result, matrix-multiplication 
tasks only allocate additional, temporary memory for the 
result sub-blocks when two or more threads compute contri¬ 
butions to the same sub-block simultaneously. The memory 
cost associated with the left- and right-hand matrices is also 
mitigated by the overdecomposition of data. This is due to 
the fact that each row and column sub-block only remains in 
memory as long as it is needed, which is not necessarily for 
the entire duration of computation for an iteration. Finally, 
the overall memory consumption can be controlled by lim¬ 
iting the number of concurrent SUMMA iterations that are 
executed on each node. If I SUMMA iterations are sched¬ 
uled at once each process’s memory potentially increases 
over that of standard SUMMA by a factor of I ; although as 
explained the resulting increase would in practice be much 
less than that. 


4. RESULTS 

The task-based 2D SUMMA algorithm is implemented in 
C-f-l- in the Tiled Array library [^. Tiled Array uses 
the MADNESS runtime to express the computation 
in terms of tasks and to manage the low-level details of 
task scheduling and data movement. Both TiledArray 
and MADNESS can be obtained under the terms of the 
GNU General Public License. 


ity of the performance analysis and is not a constraint of our 
implementation. The Basic Linear Algebra Subprograms 
(BLAS) DGEMM routine was used as the block multiply- 
add kernel in our MM tasks. On the BG/Q system, we 
use the BLAS functions in IBM’s Engineering and Scien¬ 
tific Subroutine Library (ESSL). Intel’s Math Kernel Li¬ 
brary (MKL) is used on the small, x86_64 Linux cluster. 
Although concurrent versions of DGEMM are available in 
both libraries, we use serial DGEMM to properly allocate 
resources with the MADNESS task-based parallel runtime 
system; this limitation will be removed in the near future. 
Reported wall times are an average of n repeated multipli¬ 
cations of two random matrices, where n = 30 for most 
computations except when run time limits constrained the 
sample sizes. 

Block size for the uniform MM tests are optimized for each 
system, where we select the block size with the best perfor¬ 
mance from a series of single-process MM tests with a range 
of block sizes. To determine the size of blocks in the nonuni¬ 
form MM tests, we first start by constructing M empty, row 
blocks, where M is equal to the number of row blocks in the 
uniformly-blocked matrices. We then randomly one to the 
size a row block, and repeat this step until the total number 
of rows among all blocks is equal to the number of rows in 
the uniformly blocked matrices. We repeat procedure for 
column block sizes. This ensures the average block size, as 
well as the number of blocks per matrix, in the nonuniform 
MM tests are equal to that of the uniformly-blocked matri¬ 
ces. 

4.2 Performance of Task-Based SUMMA 
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Figure 4: Floating-point operation rate for multiplication 
of uniformly- and nonuniformly-blocked square matrices on 
IBM BG/Q. The matrix sizes included in the test are 32, 768, 
65,536, 98,304, 256,000. 


4.1 Experimental Setup 

In our experiments, we compare the performance of the 
task-based SUMMA implementation for matrix multiplica¬ 
tion of square uniformly- and nonuniformly-blocked real double¬ 
precision matrices; the square shape was chosen for simplic- 


Performance measurements on high-end hardware were 
performed on “Mira,” an IBM Blue Gene/Q (BG/Q) sys¬ 
tem at Argonne National Laboratory. For each matrix size 
we varied the number of compute nodes from 2® and 2^^ 
(each node has 16 compute cores with 4 hardware threads 
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Figure 5: Wall time for multiplication of uniformly- and 
nonuniformly-blocked square matrices on IBM BG/Q. The 
matrix sizes included in the test are 32, 768, 65, 536, 98, 304, 
256,000. 


per core; thus the largest computation utilized 2^^ ~ 10® 
threads). 

In Figure]^ and 1^ we see an approximately linear increase 
in computational rate for each matrix size. The combine 
data for all matrix sizes also shows approximately linear 
weak scaling in the entire range of processor counts. The 
performance difference between the uniform and nonuniform 3 
tests is small and follow each other closely, though the tests 
with uniform block sizes are consistently faster than those 
with nonuniform block sizes. In fact, slopes for both the 
uniform and nonuniform rate tests are approximately equal. 

This suggests that the performance difference between these 
two tests is due to an intra-node effect, rather than a scaling 
issue that one would expect to see with a signihcant load im¬ 
balance. This difference is an expected result since DGEMM 
performance is sensitive to the matrix size for small matri¬ 
ces. 

We also perform a strong scaling test on a small, hfteen- 
node cluster with two eight-core 2.60 GHz Intel Xeon E5- 
2670 processors (16 cores total) and 16 GB of RAM per 
node, with an Inhniband switch. We use a matrix size of 
32, 768 for all tests on this system. The block size is set to 
256. Similarly, the average number of rows and columns per 
block in the nonuniform test was 256 for each dimension. 

The number of compute cores in this test set varies from 16 
to 240. 

Like the tests on the BG/Q system, the data shows an ap¬ 
proximately linear increase in computational rate for both 
uniform and nonuniform tests (see Figures]^ and [^. Unlike 
the previous tests, the performance difference between the 
uniform and nonuniform tests is negligibly small; neither 
set of tests clearly out performs the other. For example, 
the mean computational times of the fifteen-node, uniform- 
and nonuniform-matrix tests are 21.1716 and 20.8096 sec¬ 
onds, respectively, with standard deviations of 0.663725 and 


Figure 6: Floating operation throughput for multiplication 
of uniformly- and nonuniformly-blocked square matrices of 
size 32, 768 on a commodity cluster. 
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Figure 7: Wall time for multiplication of uniformly- and 
nonuniformly-blocked square matrices of size 32, 768 on a 
commodity cluster. 


0.544411 seconds, respectively. The mean values of these 
data points are within one standard deviation of each other, 
and other data points with the same number of processes 
are similarly close. 

4.3 Efficiency of Task-Based SUMMA 

The performance and efficiency of our task-based SUMMA 
implementation was found to be highly dependent on that 
of the matrix multiplication kernel, in this case DGEMM 
(Tiled Array supports matrices of standard single- and double- 











precision real and complex types as well as user-defined 
types). The best, single-node performance we achieved with 
ESSL is 99.1393 GFLOPS, which is 59.7% of Rmax (166.06 
GFLOPS per-node) and 48.4% of Rpeak (204.799 GFLOPS 
per-node). Typically, with a well optimized BLAS library, 
single-node performance of our task-based SUMMA imple¬ 
mentation can achieve 90% to 95% of theoretical peak. For 
example, the single node performance on our small cluster, 
using MKL, was 302.787 GFLOPS, which is 90.98% of the 
theoretical peak performance computed from the base CPU 
frequency (332.8 GFLOPS per node). The discrepancy in 
percent of machine peak between these two systems is due 
to the limited thread scalability of ESSL. Anecdotally, 50% 
of machine peak is considered “good” performance for ESSL 
BLAS Level-3 functions with 16 cores. 

In Figure]^ we show the percent of efficiency of our algo¬ 
rithm on the IBM BG/Q system, with 100% efficiency set 
at the maximum measured performance on a single node 
(99.1393 GFLOPS). The data shows that usually greater 
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Figure 8: Computational efficiency for multiplication of 
uniformly- and nonuniformly-blocked square matrices on 
IBM BG/Q. One hundred percent efficiency refers to 
the peak, measured performance of a single node, with 
Tiled Array. 


than 75% efficiency is maintained across a variety of matrix 
sizes and number of cores, relative to the single node per¬ 
formance. The efficiency drops down to as low as 50% in 
the strong scaling limit, when each processor core has 4 or 
fewer target matrix blocks. The granularity of tasks lim¬ 
its the maximum amount of parallelism achievable for each 
matrix size, which causes a steep drop in efficiency in the 
right-most tail of each data set. 

4.4 Load Variability with Nonuniform Block¬ 
ing 

The randomly sized matrix blocks in these tests are in¬ 
tended to simulate the irregular blocking structure of data 
in quantum physics problems. We used a low-quality ran¬ 
dom number generator to create significant inhomogeneity in 


the blocks sizes. The ratio of minimum to maximum mem¬ 
ory and work loads for these generated matrices are given in 
Table [T] 


Matrix Size 

Memory 

MiniMax 

Work 

Min:Max 

32768 

1:2.99 

1:4.46 

65536 

1:3.54 

1:5.77 

98304 

1:3.27 

1:5.51 

256000 

1:3.93 

1:7.12 


Table 1: Ratio of the minimum to maximum memory and 
work loads for nonuniformly-blocked matrices. 


Despite the substantial inhomogeneity at the block level, 
because each process hold many blocks due to data overde¬ 
composition, the effective load imbalance in the computa¬ 
tion and communication is somewhat smaller than what is 
reported in Table For example, with a matrix size of 
32, 768 and 256 processes, we see a ratio of 1:1.35 between 
processes with the smallest and largest memory usage. 

5. CONCLUSIONS 

We presented a fine-grained task-based reformulation of 
Scalable Universal Matrix Multiplication Algorithm (SUMMA). 
In conjunction with multiple-issue scheduling of SUMMA it¬ 
erations, the new formulation should be tolerant of data in¬ 
homogeneity due to matrix structure. The implementation 
of our algorithm in TiledArray library performed equally 
well for multiplication of square matrices with uniform and 
nonuniform blocking (the latter designed to simulate the 
domain-specific blocking structure characteristic of the elec¬ 
tronic structure domain). The implementation scales well on 
a small commodity cluster as well as a high-end IBM BG/Q 
supercomputer, with typical measured efficiencies relative to 
the single node performance of 75% on as many as 2^® cores. 
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